Lab 2 Overview
First Half
More Visualisations:
Heatmaps
HexBins vs SquareBins
Choropleths
Descriptive Statistics:
Histograms and looking at the distribution shape.
How to “quick plot” dataframes using ggplot2.
Some methods of determining bin sizes.
Second Half
Revision:
- Any code related questions for R
Let’s get started
Visualisations
Binned plots
Good visualizations vs Bad visualizastions
Personal Checklist for Visualisations and Dashboards:
Your visualisation needs to tell a story.
It should be intepretable without being overly verbose.
The scale and axis need to make sense (and you can assume the reader knows the difference between a normal scale vs log scale).
The choice of visualisation needs to make sense:
- Line plot vs Bar chart with non-numerical categories
- Scatterplot vs Histogram plot to see distribution
- etc
- Choice of colour scheme / alpha / size need to be easy on the eyes.
At the end of the day, even if you think your visualisation is “pretty” or “beautiful”, if a reader cannot understand it, then it is not a good visualisation.
Load libraries
library(dplyr)
library(sf)
library(curl)
library(ggmap)
library(tmap)
library(tmaptools)
library(feather)Load data
#try feather
filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)
df %>% tail## # A tibble: 6 x 19
## VendorID tpep_pickup_dat… tpep_dropoff_da… passenger_count trip_distance
## <int> <fct> <fct> <int> <dbl>
## 1 2 4/12/15 22:55 4/12/15 23:28 2 12.8
## 2 2 4/12/15 22:55 4/12/15 23:03 1 0.75
## 3 1 4/12/15 22:55 4/12/15 23:08 1 2.4
## 4 1 4/12/15 22:55 4/12/15 23:01 1 0.8
## 5 2 4/12/15 22:55 4/12/15 23:17 1 4.73
## 6 2 4/12/15 22:55 4/12/15 22:59 2 0.8
## # … with 14 more variables: pickup_longitude <dbl>, pickup_latitude <dbl>,
## # RatecodeID <int>, store_and_fwd_flag <fct>, dropoff_longitude <dbl>,
## # dropoff_latitude <dbl>, payment_type <int>, fare_amount <dbl>, extra <dbl>,
## # mta_tax <dbl>, tip_amount <dbl>, tolls_amount <dbl>,
## # improvement_surcharge <dbl>, total_amount <dbl>
Heatmaps
- Excellent tool when visualising geospatial coordinates.
Select the map.
xranges <- range(df$pickup_longitude[!df$pickup_longitude==0])
yranges <- range(df$pickup_latitude[!df$pickup_latitude==0])
xranges## [1] -77.04710 -71.06483
yranges## [1] 37.27044 42.73614
map_big <- get_stamenmap(
rbind(-74.3,40.45,-73.75,40.95),
zoom = 10)
ggmap(map_big)Plot pickup locations
ggmap(map_big) +
geom_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) Change the color scale.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_viridis_c(trans = "log10")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude),
bins=50) +
scale_fill_gradientn(colours = c("darkred", "orange", "yellow", "white"),trans="log10")Filled by discrete factors.
ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(passenger_count)),
bins=50) Change the color palette. A good palette makes your story better.
library(RColorBrewer)
display.brewer.all()ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type)),
bins=100) +
scale_fill_brewer(palette = "Dark2")ggmap(map_big) +
stat_bin2d(data = df,
aes(x = pickup_longitude,
y = pickup_latitude,
fill = as.factor(payment_type),
alpha = 0.95),
bins=100) +
scale_fill_brewer(palette = "Dark2")Hexagon bins avoid the visual artefacts sometimes generated by the very regular alignment of geom_bin2d().
Plot HexBin.
ggmap(map_big) +
stat_binhex(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
bins=100) +
coord_cartesian() +
scale_fill_gradient2(trans="log10")ggmap(map_big) +
stat_binhex(data=df,
aes(x = pickup_longitude, y = pickup_latitude),
alpha = 0.9,
bins=100) +
coord_cartesian() +
scale_fill_viridis_c(trans = "log10") +
facet_grid(.~payment_type)Shapefile Overlays
- NOTE: This only applies on the more recent datasets that use zones over coordinates
Requirements:
geopandas
Shapefile Links:
df_yellow = read.csv("/Volumes/you.y/MAST30034_R/data/yellow_tripdata_2019-01.csv",stringsAsFactors = TRUE)
df_yellow %>%
group_by(PULocationID,passenger_count) %>%
summarise(n=sum(passenger_count)) -> df_yellow_summary## `summarise()` regrouping output by 'PULocationID' (override with `.groups` argument)
library(sf)sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp")## Reading layer `taxi_zones' from data source
## `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/taxi_zones.shp'
## using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
zone = read.csv("/Volumes/you.y/MAST30034_R/data/shapefile/taxi+_zone_lookup.csv")Basic information in shapefile.
sf = st_read("/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones/")## Reading layer `taxi_zones' from data source
## `/Volumes/you.y/MAST30034_R/data/shapefile/taxi_zones' using driver `ESRI Shapefile'
## Simple feature collection with 263 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1067383 ymax: 272844.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
plot(sf) ** Choropleth with ggplot **
ggplot() +
geom_sf(data = sf, size = 1, color = "black", fill = "cyan1") +
ggtitle("Boundary Plot") +
coord_sf()+
theme_bw()** Choropleth with tmap **
tmap is specifically designed to make creation of thematic maps more convenient. It borrows from thw ggplot syntax and takes care of a lot of the styling and aesthetics. This reduces our amount of code significantly. We only need:
- tm_shape() where we provide the sf object (we could also provide an SpatialPolygonsDataframe)
*tm_polygons() where we set the attribute variable to map, the break style, and a title.
tm_shape(sf)+
tm_polygons("Shape_Area",style="quantile")** Web mapping with leaflet **
library(leaflet) Incorporate the information we have.
df_yellow_summary$LocationID <- df_yellow_summary$PULocationID
full_join(sf,df_yellow_summary,by="LocationID") -> sf_fullWhile tmap was tolerant about our AEA projection of philly_crimes_sf, leaflet does require us to explicitly reproject the sf object.
# reproject
sf_WGS84 <- st_transform(sf_full, 4326)
leaflet(sf_WGS84) %>%
addPolygons()sf_WGS84 %>% filter(!is.na(n)) ->sf_plot
pal_fun <- colorQuantile("YlOrRd", NULL, n = 5)
p_popup <- paste0("<strong>Passanger number: </strong>", sf_plot$n)
leaflet(sf_plot) %>%
addPolygons(
stroke = FALSE,
fillColor = ~pal_fun(n),
fillOpacity = 0.8, smoothFactor = 0.5,
popup = p_popup) %>%
addTiles() %>%
addLegend("bottomright", # location
pal=pal_fun, # palette function
values=~n, # value to be passed to palette function
title = 'Passanger number') # legend titleDescriptive Statistics
filepath = "/Volumes/you.y/MAST30034_R/data/df.feather"
df <- read_feather(filepath)summary(df)## VendorID tpep_pickup_datetime tpep_dropoff_datetime
## Min. :1.000 4/12/15 19:14: 508 4/12/15 21:00: 487
## 1st Qu.:1.000 4/12/15 22:44: 506 4/12/15 21:40: 486
## Median :2.000 4/12/15 21:57: 503 4/12/15 20:14: 484
## Mean :1.528 4/12/15 22:46: 491 4/12/15 20:58: 480
## 3rd Qu.:2.000 4/12/15 19:39: 489 4/12/15 21:01: 480
## Max. :2.000 4/12/15 19:28: 486 4/12/15 20:31: 476
## (Other) :97017 (Other) :97107
## passenger_count trip_distance pickup_longitude pickup_latitude
## Min. :0.000 Min. : 0.000 Min. :-77.05 Min. : 0.00
## 1st Qu.:1.000 1st Qu.: 1.030 1st Qu.:-73.99 1st Qu.:40.73
## Median :1.000 Median : 1.710 Median :-73.98 Median :40.75
## Mean :1.704 Mean : 2.827 Mean :-72.88 Mean :40.14
## 3rd Qu.:2.000 3rd Qu.: 3.100 3rd Qu.:-73.97 3rd Qu.:40.77
## Max. :6.000 Max. :91.200 Max. : 0.00 Max. :42.74
##
## RatecodeID store_and_fwd_flag dropoff_longitude dropoff_latitude
## Min. : 1.000 N:99421 Min. :-74.62 Min. : 0.00
## 1st Qu.: 1.000 Y: 579 1st Qu.:-73.99 1st Qu.:40.73
## Median : 1.000 Median :-73.98 Median :40.75
## Mean : 1.029 Mean :-72.93 Mean :40.17
## 3rd Qu.: 1.000 3rd Qu.:-73.96 3rd Qu.:40.77
## Max. :99.000 Max. : 0.00 Max. :41.49
##
## payment_type fare_amount extra mta_tax
## Min. :1.00 Min. :-120.00 Min. :-1.0000 Min. :-0.5000
## 1st Qu.:1.00 1st Qu.: 7.00 1st Qu.: 0.5000 1st Qu.: 0.5000
## Median :1.00 Median : 10.00 Median : 0.5000 Median : 0.5000
## Mean :1.34 Mean : 12.75 Mean : 0.6049 Mean : 0.4985
## 3rd Qu.:2.00 3rd Qu.: 15.00 3rd Qu.: 0.5000 3rd Qu.: 0.5000
## Max. :4.00 Max. : 500.00 Max. : 1.5000 Max. : 2.5000
##
## tip_amount tolls_amount improvement_surcharge total_amount
## Min. : 0.00 Min. : 0.0000 Min. :-0.3000 Min. :-120.30
## 1st Qu.: 0.00 1st Qu.: 0.0000 1st Qu.: 0.3000 1st Qu.: 9.30
## Median : 1.46 Median : 0.0000 Median : 0.3000 Median : 12.80
## Mean : 1.80 Mean : 0.2386 Mean : 0.2997 Mean : 16.19
## 3rd Qu.: 2.55 3rd Qu.: 0.0000 3rd Qu.: 0.3000 3rd Qu.: 18.36
## Max. :115.00 Max. :24.0000 Max. : 0.3000 Max. : 550.30
##
Remember that not all the data should be interpreted as purely numerical. There may be conclusions you can draw by coincidence if you incorrectly assume data types! For example: * longitude and latitude should be interpreted as geospatial coordinates * payment_type is a discrete category of payment types * trip_distance is non-linear (not a straight line from A to B), but we have no further data on it. It is also in miles. * To avoid misinterpreting the attributes, refer to the data dictionary provided on the TLC website.
Scatterplot plots for Fare / Distance
ggplot(df) +
aes(x=fare_amount,y=trip_distance) +
geom_point()+
theme_bw()General Inference:
- We can visually see that the relationship is relatively linear as you’d expect (more distance generally means more money)
- A fair number of outliers, notably around the 0 distance axis and 0 cost axis. Why might this be the case?
- Why are there negative values?
Histgram plots for Trip Distance
ggplot(df)+
geom_histogram(aes(x=fare_amount),bins = 30) +
theme_bw()- I guess we can kind of see most fares are between 0 - 100.
- Hard to tell where the main distribution is spread around.
- We can enhance this using a log transformation.
df$transform_fare_amount <- log(df$fare_amount)
ggplot(df)+
geom_histogram(aes(x=transform_fare_amount),bins = 30) +
theme_bw() * Take a log transformation to visually see the distribution
How about the distribution for trips under 15 miles?
ggplot(df %>% filter(trip_distance <=15))+
geom_histogram(aes(x=transform_fare_amount),bins = 30) +
theme_bw()df %>%
mutate(trip_dist_15 = ifelse(trip_distance>15,"yes","no")) %>%
ggplot()+
geom_histogram(aes(x=transform_fare_amount,
fill=trip_dist_15),
bins = 30) +
scale_fill_brewer(palette = "Set2") +
theme_bw()cor(df$trip_distance,df$fare_amount,method = "pearson")## [1] 0.899978
library(pheatmap)
#remove characters
pheatmap(cor(df[,-c(1:3,9)]),
cluster_rows = FALSE,
cluster_cols = FALSE)- trip_distance highly correlates with high tips, tolls and overall trip amount
- payment_type seems to have some form of negative correlation with tip_amount. Must be careful as this is a discrete category.
cors =c("passenger_count", "trip_distance", "fare_amount", "extra",
"mta_tax", "tip_amount", "tolls_amount", "improvement_surcharge", "total_amount")
cor(df[,cors])## passenger_count trip_distance fare_amount extra
## passenger_count 1.0000000000 0.001658362 0.0009505592 -0.003093358
## trip_distance 0.0016583622 1.000000000 0.8999780236 -0.223323624
## fare_amount 0.0009505592 0.899978024 1.0000000000 -0.207022734
## extra -0.0030933583 -0.223323624 -0.2070227338 1.000000000
## mta_tax 0.0131316387 -0.027817098 -0.1282543385 0.130305153
## tip_amount -0.0111149872 0.530623359 0.5573861481 -0.100025231
## tolls_amount 0.0065090920 0.572059235 0.5307124782 -0.157824284
## improvement_surcharge 0.0040463644 0.015943787 0.0489414626 0.095486378
## total_amount -0.0007104667 0.898539215 0.9824415224 -0.186521180
## mta_tax tip_amount tolls_amount
## passenger_count 0.01313164 -0.01111499 0.006509092
## trip_distance -0.02781710 0.53062336 0.572059235
## fare_amount -0.12825434 0.55738615 0.530712478
## extra 0.13030515 -0.10002523 -0.157824284
## mta_tax 1.00000000 -0.06226707 -0.094005581
## tip_amount -0.06226707 1.00000000 0.359383270
## tolls_amount -0.09400558 0.35938327 1.000000000
## improvement_surcharge 0.59152689 0.01722265 0.004483847
## total_amount -0.12168476 0.68568923 0.603590674
## improvement_surcharge total_amount
## passenger_count 0.004046364 -0.0007104667
## trip_distance 0.015943787 0.8985392152
## fare_amount 0.048941463 0.9824415224
## extra 0.095486378 -0.1865211796
## mta_tax 0.591526892 -0.1216847619
## tip_amount 0.017222652 0.6856892276
## tolls_amount 0.004483847 0.6035906745
## improvement_surcharge 1.000000000 0.0486015276
## total_amount 0.048601528 1.0000000000
pheatmap(cor(df[,cors]))Use Spark in R
See sparklyr tutorial.